Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

نویسندگان

  • Ernie Chan
  • Francisco D. Igual
چکیده

Abstract In our previous work, we have presented a systematic methodology for parallelizing dense matrix computations using a separation of concerns between the code that implements a linear algebra algorithm and a runtime system that exploits parallelism for which only relatively simple scheduling algorithms were used to parallelize a wide range of dense matrix computations. We have extended the runtime system to utilize multiple hardware accelerators yet constrained its use to a particular scheduling algorithm. In this paper, we develop a new domain-specific scheduling algorithm that addresses load balance and data locality simultaneously, which is motivated by queueing theory. In order to apply this domain-specific scheduling algorithm for utilizing multiple hardware accelerators, we implement a new software managed cache coherency mechanism that allows for the use of any scheduling algorithm. We provide performance results that validate that our domain-specific scheduling algorithm consistently attains exceptional performance on a homogeneous multicore platform and heterogeneous platform with multiple hardware accelerators.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Runtime Data Flow Scheduling of Matrix Computations

We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Well-known scheduling algorithms such as work stealing have proven time and space bounds, but these bounds do not provide a discerna...

متن کامل

Scheduling Dataflow Execution Across Multiple Accelerators

Dataflow execution engines such as MapReduce, DryadLINQ and PTask have enjoyed success because they simplify development for a class of important parallel applications. Expressing the computation as a dataflow graph allows the runtime, and not the programmer, to own problems such as synchronization, data movement and scheduling leveraging dynamic information to inform strategy and policy in a w...

متن کامل

Sparse direct solvers with accelerators over DAG runtimes

The current trend in the high performance computing shows a dramatic increase in the number of cores on the shared memory compute nodes. Algorithms, especially those related to linear algebra, need to be adapted to these new computer architectures in order to be efficient. PASTIX* is a sparse parallel direct solver, that incorporates a dynamic scheduler for strongly hierarchical modern architec...

متن کامل

Enabling Legacy Applications on Heterogeneous Platforms

In this paper we make the case for a runtime technique to seamlessly execute legacy applications on heterogeneous platforms consisting of CPUs and accelerators. We consider discrete as well as integrated heterogeneous platforms. In the former, CPU and accelerators have different memory systems; in the latter, accelerators share physical memory with the CPU. Our proposed runtime does not require...

متن کامل

An Overview of the RAPID Run-time System for Parallel Irregular Computations

RAPID is a run-time system that uses an inspector/executor approach to parallelize irregular computations by embodying graph scheduling techniques to optimize interleaved communication and computation with mixed granularities. It provides a set of library functions for specifying irregular data objects and tasks that access these objects, extracts a task dependence graph from data access patter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010